perm filename NF.DOC[4,KMC]1 blob sn#149664 filedate 1975-03-12 generic text, type T, neo UTF8
00100	COMMENT ⊗   VALID 00003 PAGES
00200	C REC  PAGE   DESCRIPTION
00300	C00001 00001
00400	C00002 00002	Specifications and features for new FRONT for PARRY.
00500	C00010 00003	***** DATA STORAGE FORMAT *********
00600	C00015 ENDMK
00700	C⊗;
     

00100	Specifications and features for new FRONT for PARRY.
00200		Some are already done in the old FRONT.
00300	Is there any reason that the following can't be done in SAIL as easily as MLISP?
00400	BILL  Historical reasons:
00500		Lisp's arbitrary data structures
00600			Especially good for retaining record of transformations performed.
00700		Lisp's interpreter for debugging
00800	
00900	***** PHILOSOPHY **********
01000	
01100	Treat SYNONM, IRREG, IDIOM, SPATS, & CPATS in uniform manner.
01200	No more clear distinction between levels of processing
01300		(input, re-spelled, canonized, pattern, λ####).
01400	The following description sounds like a production system.
01500	
01600	***** MATCHING ALGORITHM **********
01700	
01800	Let's see what Terry W. has been thinking about along these lines?
01900	
02000	Single look-up program finding longest match.  Try to start match at front of
02100	input but skip over stuff to work on middle sometimes.  Skip when longest
02200	match in table still has more stuff in it, but input differs.
02300	Use remainder of pattern to choose from multiple meanings for following
02400	word in input.
02500		e.g.	how's she doing
02600			HOW IS she doing
02700			how BE she doing
02800			how be THEY doing
02900			how be NURSES doing
03000			how be NURSE doing
03100			how be nurse DO
03200			λ1234
03300	
03400	Look-up algorithm must provide for word dropping (& letter & segment ?)
03500	A few meanings constantly change (ie "they", "bug"?)
03600		Must call function to compute meaning or
03700		update stored data for lookup.
03800	
03900	"Window" internal workings.
04000	Accept multiple sentences at once.
04100	Have all intermediate steps available until after response given.
04200		Keep a list of all transformations performed.
04300	Allow additional information to be learned during interview.
04400		Might be hard to insert in permanent data storage, but could be stuck on.
04500	
04600	***** SUFFIXES & SPELLING **********
04700	
04800	See DICTIO & PREFIX & SUFFIX on [PAR,RCP] for data.
04900	
05000	Clean up characters on input.
05100	
05200	Use algorithm like present FRONT:
05300		1) Look up
05400		2) Suffix removal
05500		3) Re-spelling
05600	
05700	Make sure that suffix is legal on that class of word
05800	and what effect it has on the class. (i.e. part-of-speech → part-of-speech)
05900	
06000	Gain information from prefixes & suffixes.
06100		Replace suffix with part of speech, number, tense, etc.
06200	
06300	For spelling correction, see SPELL.REG[UP,DOC]
06400		It does 1 extra, 1 missing, 1 wrong, 2 transposed.
06500	I will do:
06600		1 extra
06700		1 missing - add "E" at end (for suffixer)
06800		1 wrong - neighboring keys
06900			- "O" for "0"
07000			- "Y" for "I" at end (for suffixer)
07100		2 transposed
07200	
07300	Keep count of spelling corrections and failures.
07400	Handle run-on words.
07500	Treat prefixes & suffixes as run-on words?
07600	Try to identify misspelling after suffix removal.  How?
07700	
07800	***** PARTS OF SPEECH **********
07900	
08000	Higher level classification of words:
08100		NOUN		for (DOES x GET ALONG WITH y)
08200		VERB
08300		PRONOUN		for reflexives, at least
08400		ADJECTIVE	for (I BE x)
08500	Have a table (matrix) giving a λ#### for specific values of x & y.
08600	
08700	Retain HAVE, BE, & DO and change all others to "VERB".
08800	Change all nouns to "NOUN".
08900	Change all adjectives to "ADJECTIVE"
09000	Is it possible to distinguish NOUN-VERB-ADJECTIVE?
09100	
09200	***** PHRASES **********
09300	
09400	For disambiguation (= idioms) see Stone's papers.
09500	
09600	Want to temporarily replace a verb with "verb" , remove auxiliaries, then
09700		bring back specific verb.
09800	
09900	Use start-of-sentence marker to require a noun in front of some patterns.
10000	
10100	Verb phrase eliminator:  to turn all of:
10200		I GO
10300		I WENT
10400		I DID GO			into: I GO .
10500		I WAS GOING
10600		I HAVE GONE
10700		I HAVE BEEN GOING
10800	
10900	Also interrogatives:
11000		DID I GO
11100		HAVE I GONE			into: I GO ?
11200		WAS I GOING
11300		HAVE I BEEN GOING
11400	
11500	Also passives:
11600		I AM TAKEN
11700		I AM BEING TAKEN
11800		I WAS TAKEN			into: x TAKE I .
11900		I HAVE BEEN TAKEN
12000		I WAS BEING TAKEN
12100	
12200	Also passive interrogatives:
12300		AM I TAKEN
12400		AM I BEING TAKEN
12500		WAS I TAKEN			into: x TAKE I ?
12600		HAVE I BEEN TAKEN
12700		WAS I BEING TAKEN
12800	
12900	Modal verbs might also be removed:
13000		WILL
13100		CAN
13200		SHOULD
13300		OUGHT to
13400		NEED to
13500		WANT to
13600		BE ABLE to
13700		TRY to
13800	
13900	What are:
14000		SEEM to be
14100		LIKE to
14200	
14300	Have a pre-determined set of flags indicating what was removed.
14400	
14500	How will I gobble up a "noun phrase" in the input?
14600	
14700	***** OTHER **********
14800	
14900	Represent both "start of sentence" and "end of sentence" in patterns.
15000	What about "not"?
15100	Learn about unknown words from context.
15200		Patterns could specify restrictions on part of speech.
15300		Picking up Doctor's name is an instance of this?
15400	
15500	Ellipsis problems.  Try partial match with previous sentence(s) to discover
15600	missing parts.
15700	
15800	KEN..CAN PARRY PATTERN MATCH HIS OWN OUTPUT?
15900		Only if the Dr might also say it.
16000	HOW MANY AMBIGUITIES IS HE STILL MISSING?
16100		When they arise, patterns are included to distinguish important meanings.
16200	
16300	Write prototype program.  Make sure it can do everything FRONT does?!
16400	BILL Dont need to do everthing FRONT does.  Only the common easy ones.
     

00100	***** DATA STORAGE FORMAT *********
00200	
00300	All jumbled together in one huge table?  Wastes space(or search time).
00400	Fixed (maximum) size entries (and values) uses too much room.  This can be
00500	avoided by separating tables by size.  Then getting longest match
00600	requires searching tables for all sizes (starting at biggest) until found.
00700	Would a compromise get the best of both or the worst of both?  For example;
00800	just a few (= 2 or 3) tables for different sizes of data.
00900	Arbitrary size entries (and values) requires finding special marker characters
01000	which separate stuff.  This makes binary search harder and lots slower.
01100	
01200	TWO IDEAS FOR STORING THE FRONT END'S TABLES:
01300	
01400	***** 1)  SIMPLE AND LOTS OF DISK READS **********
01500	
01600	The tables are stored on disk.  Assuming each entry is composed of 36-bit words
01700	of 5 ascii chars each, store each entry packed in right next to the rest.
01800	The first word of each is either a number telling how many words the entry is,
01900	or a special word indicating a break between entries.
02000	
02100		In core is a table listing: 
02200		 1) the first entry on a disk record,
02300		 2) the address on the disk of the disk record.
02400	
02500	The lookup consists of a binary search thru the table for the proper 
02600	disk record, reading in the disk record in dump mode, and searching 
02700	linearly the disk record for the proper entry (linear search).
02800	
02900	If the table in core is too long, it can be cut in half by reading in 2 records
03000	from disk and having the entries to every other record in the table.
03100	And the table can be halved again...
03200	
03300	This method takes one disk read per entry lookup, assuming the entire
03400	index table is in core.   This may be a problem, because each entry
03500	in the table is an arbitrary length, although realistically it may
03600	be possible to distinguish between two entries by the first five words.
03700	
03800	
03900	
04000	***** 2) CLEVER AND SPACE-SAVING AND PROGRAMMING CHALLENGE **********
04100	
04200		This looks more like a transition net implementation.
04300	
04400	Everything is in core.
04500	
04600	The first word of each entry is stored in an array.  The array entries consist
04700	of this first word of ascii, and a pointer to the array of second ascii words
04800	which can follow this particular first one.   etc.  The final pointer is flagged
04900	as a pointer to the λ or whatever.
05000	
05100	The lookup consists of binary searching the top-level array on the first ascii
05200	word to be matched.  The pointer to the next-level table is used, and again
05300	a binary search.   etc.
05400	
05500	This method is the most compact that REM and I could think of.  The access
05600	might be faster than the previous one.  One problem is programming.  The
05700	other one is setting up the data structure.  It is possible, but a fascinating
05800	task in itself.